library(rethinking)
Loading required package: rstan
Loading required package: StanHeaders
Loading required package: ggplot2
rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Loading required package: parallel
Loading required package: dagitty
rethinking (Version 2.01)
Attaching package: ‘rethinking’
The following object is masked from ‘package:stats’:
rstudent
Any process that adds together random values from the same distribution converges to a normal. Whatever the average value of the source distribution, each sample from it can be thought of as a fluctuation from the average value.
prod(1 + runif(12,0,0.1))
[1] 1.881523
growth <- replicate(10000, prod(1+runif(12,0,0.1)))
dens(growth, norm.comp=TRUE)
We get the gaussian distribution back, because adding logs is equilvalent to multiplying the original numbers. So even multiplicative interactions of large deviations can produce Gaussian distributions, once we measure the outcomes on the log scale.
Probability density is the rate of change in cululative probabiity.
First, we recognize a set of variables we wish to understand. Some of these are observable. We call these data. Others are unobservable things like rates and averages. We call these parameters.
For each variable, we define it either in terms of the other variables or in therms of a probability distirbution. These definitions make it possible to learn about associations between variables.
The combination of variables and their probability distributions defines a joint generative model that can be used both to simulate hypothetical observations as well as analyze real ones.
Where W is the observed count of water, N was the total bumber of tosses, and p was the proportion of water on the globe.
A stochastic relationship is just a mapping of a variable or parameter onto a distribution. It is stochastic because no single instance of a variable on the left is known with certainty.
d <- Howell1
str(d)
'data.frame': 544 obs. of 4 variables:
$ height: num 152 140 137 157 145 ...
$ weight: num 47.8 36.5 31.9 53 41.3 ...
$ age : num 63 63 65 41 51 35 32 27 19 54 ...
$ male : int 1 0 0 1 0 1 0 1 0 1 ...
precis(d)
'data.frame': 544 obs. of 4 variables:
d2<- d[d$age >= 18, ]
Independent and identically distributed:
iid - independent and identically distributed - indicates that each value has the sam probability function, independent of the other h values and using the same parameters. * It is an epistemological assumption
de Finetti’s theorem - values which are exchangeable can be approximated by mixtures of i.i.d. distributions. Colloquially, exchangeable values can be reordered.
\(h_i\)Normal(\(\mu,\sigma\)) \(\mu\)Normal(178,20) \(\sigma\)~Uniform(0,50)
Whatever your prior, its a very good idea to plot your priors, so you have a sense of the assumption they build into a model.
Following is the mu (average) distribution of height
Following is the sigma (deviation) distribution
All the above is setting up the prior distributions from which your question is to be answered. As you will see by simulating from this distribution, you can see what your choices imply about observable height.
sample_mu <- rnorm(1e4, 178, 20)
sample_sigma <- runif(1e4, 0, 50)
prior_h <- rnorm(1e4, sample_mu, sample_sigma)
dens(prior_h)
Prior predictive simulation is very useful for assigning sensible priors, because it can be quite hard to anticipate how priors influence the observable variables.
Take for example a much flatter and less informative prior for \(\mu\), like \(\mu\)~Normal(178,100)
sample_mu <- rnorm( 1e4 , 178 , 100 )
prior_h <- rnorm( 1e4 , sample_mu , sample_sigma )
dens( prior_h )
mu.list <- seq( from=140, to=160 , length.out=200 )
sigma.list <- seq( from=4 , to=9 , length.out=200 )
post <- expand.grid( mu=mu.list , sigma=sigma.list )
post$LL <- sapply( 1:nrow(post) , function(i) sum( dnorm(
d2$height ,
mean=post$mu[i] ,
sd=post$sigma[i] ,
log=TRUE ) ) )
post$prod <- post$LL + dnorm( post$mu , 178 , 20 , TRUE ) + dunif( post$sigma , 0 , 50 , TRUE )
post$prob <- exp( post$prod - max(post$prod) )
contour_xyz( post$mu , post$sigma , post$prob )
image_xyz( post$mu , post$sigma , post$prob )
Going to do as similar to what occured in the previous chapters. The main difference is there are now two parameters that were used to be estimated.
sample.rows <- sample( 1:nrow(post) , size=1e4 , replace=TRUE , prob=post$prob )
sample.mu <- post$mu[ sample.rows ]
sample.sigma <- post$sigma[ sample.rows ]
plot( sample.mu , sample.sigma , cex=0.5 , pch=16 , col=col.alpha(rangi2,0.1) )
dens(sample.mu)
dens(sample.sigma)
HPDI(sample.mu)
|0.89 0.89|
153.8693 155.1759
HPDI(sample.sigma)
|0.89 0.89|
7.266332 8.195980
For a Gausisian likelihood and a gaussian prior on \(\sigma\), the postieror distribution is always gaussian as well regardless of sample size. It is the standard deviation \(\mu\) that causes the problem.
d3 <- sample( d2$height , size=20 )
mu.list <- seq( from=150, to=170 , length.out=200 )
sigma.list <- seq( from=4 , to=20 , length.out=200 )
post2 <- expand.grid( mu=mu.list , sigma=sigma.list )
post2$LL <- sapply( 1:nrow(post2) , function(i)
sum( dnorm( d3 , mean=post2$mu[i] , sd=post2$sigma[i] ,
log=TRUE ) ) )
post2$prod <- post2$LL + dnorm( post2$mu , 178 , 20 , TRUE ) +
dunif( post2$sigma , 0 , 50 , TRUE )
post2$prob <- exp( post2$prod - max(post2$prod) )
sample2.rows <- sample( 1:nrow(post2) , size=1e4 , replace=TRUE ,
prob=post2$prob )
sample2.mu <- post2$mu[ sample2.rows ]
sample2.sigma <- post2$sigma[ sample2.rows ]
plot( sample2.mu , sample2.sigma , cex=0.5 ,
col=col.alpha(rangi2,0.1) , xlab="mu" , ylab="sigma" , pch=16 )
See how the taill at the top of the clouds is distinctly longer.
Quadratic approximation. - handy way to quickly make inferences about the shape of the posterior.
The posterior’s peak wwill lie at the maximum a posteriori estimate (MAP).
data("Howell1")
d <- Howell1
d2 <- d[d$age >= 18, ]
flist <- alist(
height ~ dnorm( mu , sigma ) ,
mu ~ dnorm( 178 , 20 ) ,
sigma ~ dunif( 0 , 50 )
)
m4.1 <- quap( flist , data=d2 )
precis(m4.1)
m4.2 <- quap( alist(
height ~ dnorm( mu , sigma ) , mu ~ dnorm( 178 , 0.1 ) , sigma ~ dunif( 0 , 50 )
) , data=d2 )
precis( m4.2 )
NA
vcov(m4.1)
mu sigma
mu 0.1697383125 0.0002152049
sigma 0.0002152049 0.0849042150
diag( vcov( m4.1 ) )
mu sigma
0.16973831 0.08490421
Variance - Covariance Matrix Each entry show the correlation, bounded between -1 and + for each pair of parameters. The 1’s indicate a parameter’s correlation with itself.
Since the correlations are near to 0, it tell us nothing about the learning \(\mu\) tells us nothing about \(\sigma\) and visa versa. But this is quite rare more generally.
cov2cor( vcov( m4.1 ) )
mu sigma
mu 1.000000000 0.001792659
sigma 0.001792659 1.000000000
post <- extract.samples(m4.1, n = 1e4)
head(post)
precis(post)
quap posterior: 10000 samples from m4.1
You will find that the mean and standard deviation of each column will be very close to the MAP values from before.
For the multivariable sampling rethinking has a convenient function which essentially is as follows.
plot(d2$height ~ d2$weight)
Just verifies the correlation between the height and weight of these two inputs.
We are going to add an additional pice to the puzzle \(h_i\)Normal(\(\mu,\sigma\)) \(\mu_i=\alpha+\beta(x_i-\hat{x})\) \(\alpha\)Normal(178,20) \(\beta\)Normal(0,10) \(\sigma\)Uniform(0,50)
The mean \(\mu\) is no longer a parameter to be estimated. Rather as seen in line two of the model \(\mu_i\) is constructed from other parameters, \(\alpha\) and \(\beta\) and the observed x.
This is not a stochastic relationship - there is no ~ in it, but rather a = in it because the definition of \(\mu_i\) is deterministic.
The value \(x_i\) is just the weight value on row i. It refers to the same individual as the height value, \(h_i\), on the same row. The prameters \(\alpha\) and \(\beta\) are more mysterious.
Where did the come from? We made them up.. The parameters \(\mu\) and \(\sigma\) are necessary and sufficient to describe a Gaussian distribution. But \(\alpha\) and \(\beta\) are instead devices we invent for manipuating \(\mu\), allowing it to vary systematically across cases in the data.
One way to understand these made up parameters is to think of them as targets of learning. Each parameter is something that must be described in the posterior distribution.
\[ \mu_i=\alpha+\beta(x_i-\bar{x}) \]
Regression asks two questions about the mean’s outcome
We know that \(\alpha\) will be the same as \(\mu\)
But lets try to understand \(\beta\), if beta is = to 0 then the weight has no relations to height.
Lets simulate to understand:
set.seed(2971)
N <- 100
a <- rnorm(N, 178, 20)
b <- rnorm(N, 0,10)
plot( NULL , xlim=range(d2$weight) , ylim=c(-100,400) , xlab="weight" , ylab="height" )
abline( h=0 , lty=2 )
abline( h=272 , lty=1 , lwd=0.5 )
mtext( "b ~ dnorm(0,10)" )
xbar <- mean(d2$weight)
for ( i in 1:N ) curve( a[i] + b[i]*(x - xbar) ,
from=min(d2$weight) , to=max(d2$weight) , add=TRUE , col=col.alpha("black",0.2) )
b <- rlnorm( 1e4 , 0 , 1 )
dens( b , xlim=c(0,5) , adj=0.1 )
set.seed(2971)
N <- 100
a <- rnorm( N , 178 , 20 )
b <- rlnorm( N , 0 , 1 )
plot( NULL , xlim=range(d2$weight) , ylim=c(-100,400) , xlab="weight" , ylab="height" )
abline( h=0 , lty=2 )
abline( h=272 , lty=1 , lwd=0.5 )
mtext( "b ~ dnorm(0,10)" )
xbar <- mean(d2$weight)
for ( i in 1:N ) curve( a[i] + b[i]*(x - xbar) ,
from=min(d2$weight) , to=max(d2$weight) , add=TRUE , col=col.alpha("black",0.2) )
There is no more a uniquely correct prior than there is a uniquely correct likelihood. Statistical models are machines for inference. Many machines will work, but some work better than others. Priors can be wrong, but only in the same sense that a kind of hammer can be wrong for building a table.
# load data again, since it's a long way back library(rethinking)
data(Howell1)
d <- Howell1
d2 <- d[ d$age >= 18 , ]
# define the average weight, x-bar xbar <- mean(d2$weight)
m4.3 <- quap(
alist( height ~ dnorm( mu , sigma ) ,
mu <- a + b*( weight - xbar ) ,
a ~ dnorm( 178 , 20 ) ,
b ~ dlnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 50 )), data=d2 )
m4.3b <- quap( alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + exp(log_b)*( weight - xbar ), a ~ dnorm( 178 , 100 ) ,
log_b ~ dnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 50 ) ),
data=d2 )
I emphasize plotting posterior distributions and posterior pre- dictions, instead of attempting to understand a table. Once you become knowledgable about a particular type of model and kind of data, you’ll be able to confidently read tables, at least as long as you remain within a familiar family of model types
Plotting the implications of your models will allow you to inquire about things that are hard to read from tables.
precis(m4.3)
round(vcov(m4.3),3)
a b sigma
a 0.073 0.000 0.000
b 0.000 0.002 0.000
sigma 0.000 0.000 0.037
plot( height ~ weight , data=d2 , col=rangi2 )
post <- extract.samples( m4.3 )
a_map <- mean(post$a)
b_map <- mean(post$b)
curve( a_map + b_map*(x - xbar) , add=TRUE )
Above we have the mean of the model, but it does a poor job of communicating uncertainy.
post <- extract.samples(m4.3)
post[1:5,]
N <- 352
dN <- d2[ 1:N , ]
mN <- quap(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b*( weight - mean(weight) ) , a ~ dnorm( 178 , 20 ) ,
b ~ dlnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 50 )
) , data=dN )
# extract 20 samples from the posterior post <- extract.samples( mN , n=20 )
# display raw data and sample size
plot( dN$weight , dN$height ,
xlim=range(d2$weight) , ylim=range(d2$height) ,
col=rangi2 , xlab="weight" , ylab="height" )
mtext(concat("N = ",N))
# plot the lines, with transparency
for ( i in 1:20 )
curve( post$a[i] + post$b[i]*(x-mean(dN$weight)) ,col=col.alpha("black",0.3) , add=TRUE )
post <- extract.samples(m4.3)
mu_at_50 <- post$a + post$b * (50 - xbar)
dens(mu_at_50, col=rangi2, lwd=2, xlab="mu|weight=50")
HPDI( mu_at_50 , prob=0.89 )
|0.89 0.89|
158.5541 159.6593
Great start with the above, however we need to repeat the calculation for every possible weight
mu <- link( m4.3 )
str(mu)
num [1:1000, 1:352] 157 157 157 157 158 ...
# define sequence of weights to compute predictions for # these values will be on the horizontal axis
weight.seq <- seq( from=25 , to=70 , by=1 )
# use link to compute mu
# for each sample from posterior
# and for each weight in weight.seq
mu <- link( m4.3 , data=data.frame(weight=weight.seq) )
str(mu)
num [1:1000, 1:46] 138 137 137 137 137 ...
# use type="n" to hide raw data
plot( height ~ weight , d2 , type="n" )
for ( i in 1:100 )
points( weight.seq , mu[i,] , pch=16 , col=col.alpha(rangi2,0.1) )
# summarize the distribution of mu
mu.mean <- apply( mu , 2 , mean )
mu.HPDI <- apply( mu , 2 , HPDI , prob=0.89 )
# plot raw data
# fading out points to make line and interval more visible
plot( height ~ weight , data=d2 , col=col.alpha(rangi2,0.5) )
# plot the MAP line, aka the mean mu for each weight
lines( weight.seq , mu.mean )
# plot a shaded region for 89% HPDI
shade( mu.HPDI , weight.seq )
To summarize, here’s the recipe for generating predictions and intervals from the posterior of a fit model.
Use link to generate distributions of posterior values for μ. The default behavior of link is to use the original data, so you have to pass it a list of new horizontal axis values you want to plot posterior predictions across.
Use summary functions like mean or HPDI or PI to find averages and lower and upper bounds of μ for each value of the predictor variable.
Finally,use plotting functions like lines and shade to draw the lines and intervals. Or you might plot the distributions of the predictions, or do further numerical calculations with them. It’s really up to you.
The function link is not really very sophisticated. All is doing is using the formula you provided when you fit the model to compute the value of the linear models.
post <- extract.samples(m4.3)
mu.link <- function(weight) post$a + post$b*( weight - xbar )
weight.seq <- seq( from=25 , to=70 , by=1 )
mu <- sapply( weight.seq , mu.link )
mu.mean <- apply( mu , 2 , mean )
mu.HPDI <- apply( mu , 2 , HPDI , prob=0.89 )
4.4.3.5 Prediction intervals
sim.height <- sim( m4.3 , data=list(weight=weight.seq) )
str(sim.height)
num [1:1000, 1:46] 144 130 134 136 138 ...
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )
# plot raw data
plot( height ~ weight , d2 , col=col.alpha(rangi2,0.5) )
# draw MAP line
lines( weight.seq , mu.mean )
# draw HPDI region for line
shade( mu.HPDI , weight.seq )
# draw PI region for simulated heights
shade( height.PI , weight.seq )
sim.height <- sim( m4.3 , data=list(weight=weight.seq) , n=1e4 )
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )
# plot raw data
plot( height ~ weight , d2 , col=col.alpha(rangi2,0.5) )
# draw MAP line
lines( weight.seq , mu.mean )
# draw HPDI region for line
shade( mu.HPDI , weight.seq )
# draw PI region for simulated heights
shade( height.PI , weight.seq )
Rolling over own sim:
post <- extract.samples(m4.3)
weight.seq <- 25:70
sim.height <- sapply( weight.seq , function(weight)
rnorm(
n=nrow(post) ,
mean=post$a + post$b*( weight - xbar ) ,
sd=post$sigma ) )
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )
The most common polynomial regression is a parabolic model of the mean:
\[ \mu_i=\alpha+\beta_ix_i+\beta_2x_i^2 \]
Its the same linear function of x in a linear regresion, just with a little “1” subscript added to the parameter name.
Approximating the posterior is straightforward.
weight.seq <- seq( from=-2.2 , to=2 , length.out=30 )
pred_dat <- list( weight_s=weight.seq , weight_s2=weight.seq^2 )
mu <- link( m4.5 , data=pred_dat )
mu.mean <- apply( mu , 2 , mean )
mu.PI <- apply( mu , 2 , PI , prob=0.89 )
sim.height <- sim( m4.5 , data=pred_dat )
height.PI <- apply( sim.height , 2 , PI , prob=0.89 )
quadradic function
plot( height ~ weight_s , d , col=col.alpha(rangi2,0.5) )
lines( weight.seq , mu.mean )
shade( mu.PI , weight.seq )
shade( height.PI , weight.seq )
d$weight_s3 <- d$weight_s^3
m4.6 <- quap(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b1*weight_s + b2*weight_s2 + b3*weight_s3 , a ~ dnorm( 178 , 20 ) ,
b1 ~ dlnorm( 0 , 1 ) ,
b2 ~ dnorm( 0 , 10 ) ,
b3 ~ dnorm( 0 , 10 ) ,
sigma ~ dunif( 0 , 50 )
), data=d )
Cubic
plot( height ~ weight_s , d , col=col.alpha(rangi2,0.5) , xaxt="n" )
at <- c(-2,-1,0,1,2)
labels <- at*sd(d$weight) + mean(d$weight)
axis( side=1 , at=at , labels=round(labels,1) )
data("cherry_blossoms")
d <- cherry_blossoms
precis(d)
'data.frame': 1215 obs. of 5 variables:
Basis function \[ \mu_i=\alpha+w_iB_{i,1}+w_2B_{i,2}+w_3B_{i,3}+... \]
\(B_{i,b}\) is the nth basis function’s value on row i, and the w parameters are corresponding weights foreach. Te parameters act like slopes , adjusting the influence of each basis function on the mean \(\mu_i\). So realy this is just another linear regression ,but with some fancy, synthetic predictor variables. Thes synthetic variables do some really elegant descriptive geocentric work.
knot_list
0% 7.142857% 14.28571% 21.42857% 28.57143% 35.71429% 42.85714% 50% 57.14286% 64.28571%
839.0000 937.2143 1017.4286 1097.6429 1177.8571 1258.0714 1338.2857 1418.5000 1498.7143 1578.9286
71.42857% 78.57143% 85.71429% 92.85714% 100%
1659.1429 1739.3571 1819.5714 1899.7857 1980.0000
m4.7 <- quap(
alist(
T ~ dnorm( mu , sigma ) ,
mu <- a + B %*% w ,
a ~ dnorm(6,10),
w ~ dnorm(0,1),
sigma ~ dexp(1)
),
data=list( T=d2$temp , B=B ) ,
start=list( w=rep( 0 , ncol(B) ) ) )
post <- extract.samples(m4.7)
w <- apply( post$w , 2 , mean )
plot( NULL , xlim=range(d2$year) , ylim=c(-2,2) ,
xlab="year" , ylab="basis * weight" )
for ( i in 1:ncol(B) ) lines( d2$year , w[i]*B[,i] )
mu <- link( m4.7 )
mu_PI <- apply(mu,2,PI,0.97)
plot( d2$year , d2$temp , col=col.alpha(rangi2,0.3) , pch=16 )
shade( mu_PI , d2$year , col=col.alpha("black",0.5) )